Dataset Splitting Without Data Leakage

ND320 AIHCND C01 L03 A07 Dataset Splitting Without Data Leakage

00:00

Data Splitting Key Points

This may seem like a trivial topic, however, it is actually something that is often done incorrectly and can lead to significant downstream issues later.

Imagine that you spent significant time working on a model only to find out that your results are invalid because the data was not prepared correctly. This sometimes happens due to a rush to process data without enough testing and validation of the data and steps before modeling.

Challenges:

Data Leakage: Inadvertently sharing data between test and training datasets.

Data leakage is a massive problem as your model will perform fantastically during training and fail miserably in production.

An example of where this can occur is when you have a longitudinal dataset and you use different patient encounters across different splits. You may inadvertently leak in information about the patient into your training data that you will be testing on. Essentially giving your model some of the answers. So preventing data leakage is very important to ensure your mode can generalize in production.

Additional Resources

Kaggle Preventing Data Leakage

Challenges continued

Representative Splitting: Having accurate labels and demographics in your data splits that reflect the real world.
A major challenge for most machine learning problems is a generalization and building a dataset that is representative.
Common errors that can occur include:

Non-representative distribution of your label across the splits
Non-representative demographics

Example: Only female patients in training and male in testing

This would introduce some unintended biases and issues in your model for procedures that are gender-specific.

Testing and Validating Dataset Splitting

It is important to have some ways to assess whether you have split your data right.

Here are a few ways to do this.

Assess to make sure that a single patient's data is not in more than one partition to avoid possible data leakage.
Check that the total number of unique patients across the splits is equal to the total number of unique patients in the original dataset. This ensures no patient information lost in the splitting and that the counts are correct.
Check that the total number of rows in original dataset should be equal to the sum of rows across all three dataset partitions.

len(original_df) == len(train_df) + len(val_df) + len(test_df) should evaluate to True.

Dataset Splitting

SOLUTION:

If not done properly you can inadvertently leak data into training.
Having representative data across your partitions will ensure your model generalizes better in production.

Dataset Splitting Part 2

SOLUTION:

Making sure that the length of the original dataset is equal to the sum of the lengths of the split partitions.

Dataset Splitting Part 3

SOLUTION:

Explore your data before and after the splits.
Check that a single patient's data is not in more than one partition
Check that the total number of unique patients across the splits is equal to the total number of unique patients in the original dataset.
Check that the length of the original dataset is equal to the sum of the lengths of the split partitions.